# Vision-Language Fusion
DAM 3B Self Contained
Other
DAM-3B is a vision-language model capable of generating fine-grained local descriptions based on user-specified image regions (points/boxes/sketches/masks).
Image-to-Text English
D
nvidia
824
17
Gemma 3 4b It Abliterated Q4 0 GGUF
This model is a GGUF format conversion of mlabonne/gemma-3-4b-it-abliterated, combined with the visual component of x-ray_alpha for a smoother multimodal experience.
Image-to-Text
G
BernTheCreator
160
1
Diagram To Code Agent
Apache-2.0
This model is a vision-language fusion model specifically designed to convert diagrams into structured code.
Image-to-Text English
D
DiagramAgent
51
0
Colpali V1.3
MIT
ColPali is a visual retrieval model based on PaliGemma-3B and ColBERT strategy, designed for efficient indexing of document visual features
Text-to-Image English
C
vidore
96.60k
40
Chemvlm 8B
Apache-2.0
ChemVLM-8B is an 8-billion-parameter multimodal large language model specifically designed for the chemistry domain, capable of processing both text and visual chemical information.
Image-to-Text
Transformers

C
AI4Chem
117
6
Colpali
MIT
ColPali is a visual retrieval model based on PaliGemma-3B and ColBERT strategy, designed for efficient document indexing from visual features.
Text-to-Image English
C
vidore
12.88k
436
Mmalaya
Apache-2.0
MMAlaya is a multimodal system developed based on the large language model Alaya, comprising three core components: a large language model, an image-text feature encoder, and a feature transformation module.
Image-to-Text
Transformers

M
DataCanvas
31
1
Llava Plus V0 7b
LLaVA-Plus is a pluggable learning skill-based large language and vision assistant, primarily used for academic research in multimodal models and chatbots.
Text-to-Image
Transformers

L
LLaVA-VL
79
38
Llava V1.5 13b Lora
LLaVA is an open-source multimodal chatbot, fine-tuned from LLaMA/Vicuna and trained on GPT-generated multimodal instruction-following data.
Text-to-Image
Transformers

L
liuhaotian
143
26
Featured Recommended AI Models